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ABSTRACT Divergent natural selection caused by differences in solar exposure has resulted in distinctive 
variations in skin color between human populations. The derived light skin color allele of the SLC24A5 
gene, A111T, predominates in populations of Western Eurasian ancestry. To gain insight into when and 
where this mutation arose, we defined common haplotypes in the genomic region around SLC24A5 across 
diverse human populations and deduced phylogenetic relationships between them. Virtually all chromo- 
somes carrying the A1 1 7 7" allele share a single 78-kb haplotype that we call C1 1 , indicating that all instances 
of this mutation in human populations share a common origin. The C11 haplotype was most likely created 
by a crossover between two haplotypes, followed by the A111T mutation. The two parental precursor 
haplotypes are found from East Asia to the Americas but are nearly absent in Africa. The distributions of 
C1 1 and its parental haplotypes make it most likely that these two last steps occurred between the Middle 
East and the Indian subcontinent, with the A1 1 1T mutation occurring after the split between the ancestors 
of Europeans and East Asians. 
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Human skin pigmentation varies widely between populations, gener- 
ally decreasing with distance from the equator. According to a hypoth- 
esis proposed by Loomis (1967) and elaborated by Jablonski and 
Chaplin (2000), decreased exposure to solar ultraviolet at high lati- 
tudes produces a strong selective advantage for decreased skin pig- 
mentation because it permits increased dermal vitamin D synthesis. 
Consistent with this hypothesis, in people of European descent, the 
pigmentation locus SLC24A5 shows strong evidence of selection 
(Lamason et al. 2005; Sabeti et al. 2007; Grossman et al. 2010) A 
specific coding polymorphism in this gene (rs 1426654) is a major 
contributor to the pigmentation difference between Africans and 
Europeans (Lamason et al. 2005; Stokowski et al. 2007). Frequencies 
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display strong population differentiation, with the derived light skin 
pigmentation allele (All IT) fixed or nearly so in all European pop- 
ulations and the ancestral allele predominant in sub-Saharan Africa 
and East Asia (Lamason et al. 2005; Norton et al. 2007). The genomic 
region of diminished sequence variation in Europeans spans -150 kb 
(Lamason et al. 2005). To learn about the time and location of origin 
of the All IT mutation, we studied haplotypes in the region around 
SLC24A5 across world populations. 

MATERIALS AND METHODS 

Population-specific frequency data used for AlllT y shown in Figure 1 
and Supporting Information, Table SI, were obtained from the following 
sources: HapMap populations (Altshuler et al. 2010) (http://hapmap. 
ncbi.nlm.nih.gov/); Human Genome Diversity Project (HGDP) popula- 
tions (Norton et al. 2007) or the CEPH database (http://www.cephb.fr/ 
en/cephdb/); populations from Sri Lanka (Soejima and Koda 2007); 
ALFRED (Cheung et al. 2000) (alfred.med.yale.edu/alfred/); Indian sam- 
ples (Indian Genome Variation Consortium 2008) (http://igvdb.res.in); 
and additional Mediterranean and South Indian samples (Behar et al. 
2010). HapMap populations consist of CEU (CEPH individuals of 
northern and western European descent, sampled in Utah), TSI (Tuscan, 
Italy), GIH (Gujarati, sampled in Houston), CHB (Chinese, Beijing), 
CHD (Chinese descent, sampled in Denver), JPT (Japanese, Tokyo), 
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Figure 1 World distribution of A111T 
polymorphism in SLC24A5. The origins 
of sampled populations are indicated 
(+). Contours of global frequencies of 
SLC24A5 A111T are shaded according 
to the frequency/color scale to the 
right. Frequency data by population 
that were used for this figure are tabu- 
lated in Table S1 . 



YRI (Yoruba, Ibadan, Nigeria), LWK (Luhya, Webuye, Kenya), MKK 
(Maasai, Kinyawa, Kenya), ASW (African- Americans from southwest 
United States), and MEX (Mexican, sampled in Los Angeles). 

Haplotypes were derived from several datasets. Haplotypes initially 
were classified by use of the polymorphism data of HapMap phase 3 
(Altshuler et al 2010). Denser data such as those of the 1000 Genomes 
Project phase 1 data (1000 Genomes Project Consortium 2012) were 
used to define phylogenetic relationships between haplotypes. Clarity 
also was gained by considering haplotypes within blocks of linkage 
disequilibrium, designated A, B, C, and D, and naming longer haplo- 
types as combinations of these local haplotypes. Blocks A, B, and C+D 
are delimited by locations of recombination hotspots (estimated rates 
>5 cM/Mb) deduced from HapMap phase 2 data (International Hap- 
Map Consortium et al 2007), whereas the boundary between blocks C 
and D was denned by an interval with modestly elevated recombination 
(-0.8 cM/Mb). Precise boundaries chosen were nt 48321236-48370103 
for A, 48370104-48390052 for B, 48390053-48468019 for C, and 
48468020-48517471 for D, using b37/hgl9 coordinates. 

Initial analysis of regions A and B used only single-nucleotide 
polymorphisms (SNPs) from HapMap Phase 3, Release 27 (Altshuler 
et al 2010) that had been genotyped and phased in each included 
population (11 and 6 SNPs, respectively), whereas for regions C and 
D, SNPs lacking data for the TSI and/or MEX samples also were 
included (16 and 8 SNPs, respectively); these SNPs are listed in Table 
S2. Phased haplotypes were retrieved from HapMap and compared 
with raw genotype data for consistency. For the core region C, the 
absence of genotypes for SNPs cl and cl3 prevented distinguishing 
between members of the C2/C3 haplotype pair (in TSI) and the C6/C7 
pair (in TSI and MEX), respectively. For TSI and GIH, which were 
phased on the basis of CEU, a small number of individuals heterozy- 
gous for the ancestral allele of All IT (1/1 and 3/8, respectively) were 
misphased. Corrected haplotype assignments are shown in all figures 
and tables except Figure 5 and those in which haplotype combinations 
are displayed (Figure S2, Figure S3, File S2, File S3, and File S4). For 
subregion D, only 1 of 8 SNPs was genotyped in TSI, precluding 
haplotype assignment for this sample. 

A more complete picture of variation was obtained by examining 
phased haplotype information from the 1000 Genomes Project 
(http://mathgen.stats.ox.ac.uk/impute/ALL_1000G_phaselintegrated_ 
v3_impute.tgz; phase 1 version 3 of March 2012). The core region 
encompassed 767 polymorphisms in the core region, consisting of 23 
biallelic indels and 744 SNPs. Because phasing becomes less reliable as 
allele frequency decreases and is undetermined for variants observed 
only once (287), analysis focused on the 156 polymorphisms with 
minor allele frequencies of 1% or greater, corresponding to 291 dis- 
tinguishable haplotypes. A second dense set of SNPs also was com- 
pared with HapMap phase 2 data. After removal of monomorphic 



positions, 84 SNPs remained, 18 of which were found only in rare 
haplotypes. Phased haplotypes were retrieved from HapMap, Release 
21. For phylogenetic analysis, graphs were drawn by the use of a sim- 
ple nearest-neighbor approach and rooted by the use of ancestral 
alleles determined by comparison with other primate sequences. 
The few recombination events were identified by inspection, appear- 
ing as polymorphisms with transitions on more than one edge. 

For HGDP sample haplotypes, the data of Li et al (2008) were 
used. The genotype data contain 13 SNPs in region C that include 8 
SNPs shared with HapMap Phase 3 plus 5 additional SNPs listed in 
Table S2. Phased haplotypes were estimated using Phase 2.1.1 (Stephens 
et al 2001; Stephens and Donnelly 2003). In this dataset, all haplotypes 
were distinguishable except C6/C7 and C9/C10, due to the absence 
of data for SNP cl3 for the former and c2 or cl2 for the latter. 
Although the dataset lacks genotypes for SNP ell (rsl426654) (Li 
et al 2008), two independently obtained genotypes (Norton et al 
2007) and http://www.cephb.fr/en/hgdp/main.php (laboratory 
HDCEPH68) were available. These were generally consistent with 
the deduced haplotypes (-1% mismatch) and were used to distin- 
guish two haplotype pairs that were not otherwise distinguishable 
(C2 and Cll vs. C3 and C9/C10). Similar procedures were used to 
analyze haplotypes for seven Middle Eastern, North African, East 
African, and South Indian samples by the use of the data from Behar 
et al (2010). The genotype data contain the same set of SNPs avail- 
able for the HGDP, plus SNP cll and rs4775737. 

Because haplotypes were derived from a variety of datasets, we tested 
both SNPs and deduced haplotypes for consistency. In the genomic 
region analyzed, we found no evidence of errors in allele designation. 
No source reported common haplotypes that did not correspond to 
those found in other datasets. All common core haplotypes in HapMap 
Phase 3 samples were represented by unambiguously phased examples. 
For a small number of individuals, HapMap and 1000 Genomes 
analyses yielded divergent phasing across the B-C region boundary. We 
have avoided drawing conclusions on the basis of rare haplotypes that 
could be artifactual products of genotyping or phasing errors. 

To estimate the age of the All IT mutation, we used a molecular 
clock approach. We first determined the rate of mutation in the 
combined C and D regions from number of differences between 
human and chimpanzee reference sequences. In this alignment, nt 
changes within 4 nt of gaps were excluded to remove potential biases 
caused by misalignment. For calibration, we used 6 million years, the 
midpoint of the range of estimates (5—7 million years) for the di- 
vergence time between human and chimpanzee as identified by 
Kumar et al (2005) and assumed equal mutation rates (per year) in 
the human and chimpanzee lineages. We then counted the number of 
single-nucleotide differences from the modal haplotype for each Cll- 
D4— containing chromosome in the 1000 Genomes dataset by using 
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all reported variants. Each chromosome provides an imprecise esti- 
mate of the time since the origin of the haplotype; values were aver- 
aged over individual populations, or the entire sample. Because 
accumulation of mutations in a single lineage is independent of pop- 
ulation size, this procedure does not require demographic assump- 
tions or data. Although All IT is subject to selection, we assume only 
that subsequent mutations are neutral. Our approach to dating selective 
sweeps differs from that used by Rozas et al. (2001) and Meiklejohn 
et al. (2004), which counts the number of affected sites and under- 
estimates the coalescence time unless each sampled lineage is in- 
dependent. In contrast, our estimate is unbiased in the presence of 
nonindependent samples. Because the most frequent Cll + D4 var- 
iant carrying an additional mutation occurs 36 times in 1013 chro- 
mosomes, the independence condition is clearly not met. 

To estimate confidence limits, we considered the effective 
sample size to be the number of chromosomes that could be 
sampled without violating the assumption of independence. This 
was determined by resampling, with replacement, until a variant 
haplotype was duplicated. The median counts without repeat in 
10,000 replicates ranged from 9 to 14 for individual population 
subsamples, to 19 for the combined sample, substantially lower 
than the total number of chromosomes sequenced. These values, 
multiplied by the observed mutation frequencies, were then used to 
calculate confidence limits under the assumption that mutations 
follow a Poisson distribution. 

The aforementioned approach is most applicable when all variants 
in the chromosomal region under study have been determined. At the 
current sequencing depth in 1000 Genomes data (5.1-fold across 



autosomes), we expect some rare variants to be undetected, resulting 
in an underestimate of the age of All IT. We made an approximate 
correction for the underestimate by inverting the estimated power 
relationship (1000 Genomes Project Consortium 2012). In particular, 
the corrected count is given by 

where A { is the observed number of sites at which variants are 
reported to occur i times in the sample and P z is the power to detect 
variant sites with i occurrences. This calculation ignores the effects 
of miscounts (e.g., a variant occurring three times is detected but 
inaccurately reported as having two or four occurrences), and 
assumes that the genome- wide power applies to the genomic region 
under investigation. Because the largest contribution is undetected 
singletons (87% of the estimated excess), the assumption that mis- 
counts contribute little to the outcome is reasonable. Applying this 
procedure using the SNP frequency spectrum for C11-D4 across all 
samples suggests that the reported age range should be increased by 
a factor of approximately 1.58. 

Comparing different population subsamples, we noted a trend 
toward greater counts in the admixed New World populations. 
Misassignment of rare or unique variants in individuals heterozygous 
for All IT (common in the admixed, but rare in the European sam- 
ples) was a potential explanation of this trend, but the absence of any 
difference in variant frequency between heterozygotes and homozy- 
gotes in pooled Puerto Rican/ Colombian/Mexican samples suggests 
that this was not a significant source of bias. 
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Figure 2 Genomic region surrounding SLC24A5. The first three panels show SNPs (red symbols, SNPs used for haplotype analysis in HapMap 
Phase 3, Release 27, also listed in Table S2; black symbols, additional SNPs genotyped in fewer HapMap populations, or flanking SNPs not 
analyzed here). SNPs monomorphic in the original four HapMap population samples (CEU, CHB, JPT, YRI) are omitted. Top, SNP heterozygosity 
in CEU, calculated from allele frequencies, illustrating the region of diminished variation in this European sample. Middle, SNP allele frequency 
difference between CEU and YRI, showing several SNPs with extremely high frequency differentiation (Af > 0.8); rs1 426654 is circled. Beneath is 
a recombination map showing boundaries of blocks used for haplotype analysis (A through D). Region C, as described in text, is the core region 
containing SLC24A5. Bottom, positions of SLC24A5 and other genes in region, with common coordinates (NCBI build B36) used for all panels. 
The centromere is to left. 
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Figure 3 Core-region haplotypes defined using 16 SNPs. (A) Diagram 
showing relationships between common haplotypes C1 through C1 1 . 
SNPs that differ between adjacent haplotypes are labeled by nick- 
name (SNP c1 through c1 6, keyed to Table S2). Haplotype C1 contains 
ancestral alleles for each SNP in this series. Haplotype C1 1 carries the 
derived Thr-111 allele of SLC24A5 (rs1 426654, SNP c11). The only 



RESULTS AND DISCUSSION 

Global distribution of the A111T mutation in SLC24A5 

The geographical distribution of the All IT allele of SLC24A5 (Norton 
et al 2007), updated with the use of additional population samples 
(Figure 1), shows that All IT is nearly fixed in all of Europe and most 
of the Middle East, extending east to some populations in present-day 
Pakistan and north India. All IT shows a latitudinal decline toward 
the Equator, with high frequencies in Northern Africa (>0.80), in- 
termediate (0.4—0.6) in Ethiopia and Somalia, and lower (<0.35) in 
sub-Saharan Africa. This pattern is broadly consistent with strong 
positive selection for decreased skin pigmentation throughout Europe. 
There is a cline of decreasing frequency of All IT in indigenous 
populations east of approximately longitude 75° in Central Asia, with 
near-absence in East Asia, Oceania, and the Americas. The extent to 
which the spread of All IT to the east has been inhibited by the 
absence of substantial eastward population migrations postdating its 
origin or by the presence of other loci responsible for decreased skin 
pigmentation in East Asia is presently unclear. 

Characterization of haplotypes in the genomic region 
encompassing SLC24A5 

Diminished variation in the genomic region around SLC24A5 in the 
HapMap CEU (European ancestry) sample led us to ask what the 
haplotypes associated with the All IT allele looked like, and how, 
when, and where they might have arisen. We therefore investigated 
haplotypes spanning this genomic region (Figure 2). The haplotypes 
are described in the context of four contiguous subregions defined by 
blocks of linkage disequilibrium, here designated A (49 kb), B (20 kb), 
C (78 kb), and D (49 kb) (Figure 2). Blocks B, C, and D together 
encompass the region of diminished variation in CEU. Analysis of the 
core subregion C, which includes SLC24A5, yielded 46 haplotypes in 
HapMap Phase 3 populations (Table S3 and Table S4). The 11 hap- 
lotypes with individual abundances >0.5%, which we designate CI 
through Cll, collectively comprise 93-98% of the total in each pop- 
ulation (Figure 3, Table 1, and Table 2). A single haplotype, Cll, 
accounts for 97% of all instances of the All IT variant of SLC24A5. 
Most of the haplotypes with frequencies <0.5% appear to be products 
of recombination between more frequent haplotypes. Analysis of com- 
mon haplotypes found in HGDP and other samples (Li et al 2008; 
Behar et al 2010) yielded results matching those derived from Hap- 
Map samples (Table S5 and Table S6), including the equivalence 
between haplotype Cll and the derived allele of rs 1426654. This 
finding is consistent with a common origin for All IT worldwide. 
Analysis of data from the 1000 Genomes Project indicated that hap- 
lotypes defined on the basis of 16 SNPs corresponded to sets of closely 
related haplotypes (Figure 4 and File SI). Common haplotypes defined 
using 1000 Genomes data differed from the ancestral state at 27—48 
positions. The number of haplotypes detected depends on the number 
of polymorphisms used to define them; with inclusion of lower fre- 
quency variants, an increasing fraction of chromosomes corresponds 
to rare haplotypes (Table S7). 



duplicated transition in this diagram is for SNP c1 (rs1 834640). Note 
that analysis of additional SNPs reveals cases in which adjacent 
haplotypes in the diagram do not have ancestor-descendent relation- 
ships (compare Figure 4). (B) Histograms showing the distribution of 
haplotypes in several HapMap populations. The three East Asian pop- 
ulations are combined, as are CEU and TSI. Indeterminate C2/C3 in 
TSI indicated. MEX and ASW are not depicted. The frequency scale is 
split at 0.2. 
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Phylogenetic relationships between core haplotypes 

Phylogenetic relationships between haplotypes determined using 1000 
Genomes data (Figure 4) were equivalent to those deduced from 
HapMap phase 2 data. The branches comprising CI, C2, C3, C4, 
and C5-C11 are early diverging clades. The most abundant and ex- 
tensive lineage includes three branches: C5, C6-C7, and C9-C11. 
Within the C9-C11 branch, the C9 cluster is ancestral to C10, but 
nearly all instances of C9 include additional polymorphisms that dis- 
tinguish them from common ancestors with C10. In contrast, the 
most commonly observed C10 variant is ancestral to Cll. 

Identification of a 147-kb founder haplotype 
containing A111T 

The region of diminished variation in Europeans includes subregions 
B and D, flanking the core region (Figure 2). B- and D-region hap- 
lotypes are summarized in Figure SI, A and B and Table S8, Table S9, 
Table S10, and Table SI 1. It is apparent that a variety of combinations 
of B- and C-region haplotypes arose by recombination (Figure 5 and 
File S2). The smaller number of combinations of C- and D-region 
haplotypes observed (Figure S2 and File S3) is consistent with the 
occurrence of fewer crossovers near this location, including one an- 
cestral to C10, Cll, and a subset of C9. The B- and D-subregion 
haplotypes associated most strongly with haplotype Cll are B6 
(94%) and D4 (>99%), respectively (Figure 5, Figure S2, and File 
S2 and File S3). This pattern also holds in HapMap populations in 
which Cll is far from fixation and in which a diversity of other B- and 
D-region haplotypes not associated with Cll are found, including 
MKK. Interestingly, the greatest diversity of B-region haplotypes as- 
sociated with Cll is found in GIH (89% B6). Taken together, these 
results establish that the 147-kb founder haplotype containing All IT 
was B6 + Cll + D4. 

Subregion A, to the left of subregion B, lies outside the region of 
diminished variation around SLC24A5 observed in Europeans (Lamason 
et al 2005), indicating that recombination occurred between the A 
and B regions after the origin of All IT but before its fixation. An 
analysis of the A subregion is shown in Figure SIC, Table SI 2, and 
Table SI 3. Two haplotypes, Al and A5, predominate (together total- 
ing 89-99%) in association with Cll (Figure S3 and File S4). The 
relative proportions of Al and A5 associated with B6 + Cll vary 
considerably among populations (Table S14), presumably a result of 
genetic drift. The data do not allow a determination of whether Al or 
A5 was a part of the founding All IT haplotype. 

Recombination was involved in the creation of C11 

Haplotype C3 and Cll share the derived allele of SNP cl (Table 1), 
suggesting the possibility of recombination. To test this possibility, we 
examined SNPs not genotyped in HapMap Phase 3. Haplotype Cll 
carries ancestral alleles of SNPs rsl2441154 and rs57108441, whereas 
C10 carries the derived alleles, a pattern readily explained by a single 
crossover between C3 and C10 (Figure 6). In support of this notion, 
the B-region haplotype found associated with Cll, B6, is also the one 
most commonly associated with C3; conversely 96% of C10 haplo- 
types are associated with B region haplotypes other than B6 (68% with 
B2; File S2). 

Can we determine whether recombination involving C3 preceded 
or followed the mutation that created All IT? Models in which the 
recombination or mutation occurred first produce the same end prod- 
uct but proceed through different intermediates, corresponding to C26 
or C22, respectively (Table S3). Rare haplotypes matching both po- 
tential Cll precursors were found. Because either could have been 
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Table 2 Distribution of common core-region haplotypes in HapMap populations 
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produced by recombination subsequent to the origin of Cll, their 
occasional occurrence is not informative. However, an evolutionary 
argument strongly suggests an order of events (Figure S4). If the 
crossover predated the mutation, the predicted intermediate (C26) 
would not have experienced positive selection on the basis of lighter 
pigmentation (Figure S4A). Selection for decreased skin pigmentation 
would cause the predominant haplotype containing All IT to be Cll, 
as is observed. On the other hand, if the All IT mutation preceded the 
crossover, the intermediate haplotype (C22) would be predicted to 
experience the same selective pressure as Cll (Figure S4B). Because 
Cll is derived by recombination between C3 and C22 in this model, 
C22 would be expected to predominate over Cll, unless Cll had 



a selective advantage over C22. This outcome is not what is observed. 
Rather, the frequency of C22 is only approximately 1% that of Cll. 
Furthermore, association with diverse B- region haplotypes rather than 
one makes it most likely that the existing instances of C22 are the 
result of recombination after the formation of Cll rather than relicts 
of a precursor to Cll. We conclude that the crossover most likely 
preceded the All IT mutation. 

Phylogeographic analysis of SLC24A5 
haplotype distributions 

The world distribution of core region haplotypes, together with their 
phylogenetic relationships, suggests which haplotypes likely originated 




Figure 4 Phylogenetic relationships among core 
haplotype clusters. Haplotypes were deduced from 
1000 Genomes data by the use of 156 polymor- 
phisms with frequency >1%. (A) Haplotypes having 
frequencies >0.5% (representing 78% of total) are 
depicted. Complete dataset is found in File S1 . Sets 
of haplotypes corresponding to those deduced us- 
ing 16 SNPs are enclosed within red boxes and la- 
beled C1 through C1 1 . C8, which constituted 
<0.5%, is not shown. Two variants of C3 differ at 
an indel that shows evidence of recurrent mutation 
and for which the ancestral state is unknown. Values 
within each circle indicate number of occurrences. 
Gray and yellow shading indicates haplotypes that 
are predominantly African or East Asian (in this data- 
set), respectively. Numbers of polymorphic differen- 
ces on each branch are indicated if greater than one. 
Inclusion of lower frequency polymorphisms would 
raise the number of C1 -specific variants to 26, 
whereas C4-specific variants would increase to 1 1 . 
Early haplotype lineage divergence, depicted within 
the dashed circle, provides evidence for multiple re- 
combination events that are not resolved in this phy- 
logeny. The numbers within the dashed circle 
represent number of polymorphic differences from 
the ancestral state; this involves 31 polymorphisms that have derived alleles shared by more than one lineage. (B) Relationships among early 
diverging branches supported by 21 polymorphisms. (C) Alternative relationships among early diverging branches supported by 5 polymor- 
phisms. The ancestral state is represented by the open circle in B and C. 



2064 I V. A. Canfield et a/. 



Genes | Genomes | Genetics 



CEU 



MKK 



CHB 



ooooooooooo 



TSI 



O T- 

OOOOOOOOOOO 



GIH 



O T- 

OOOOOOOOOOO 



B1 
B2 
B3 
B4 
B5 
B6 
B7 



B1 
B2 
B3 
B4 
B5 
B6 
B7 



B1 
B2 
B3 
B4 
B5 
B6 
B7 



B1 
B2 
B3 
B4 
B5 
B6 
B7 



T-c\iro^ifi(ONcoo)T-T- 

ooooooooooo 



LWK 



T-CMCOTtmCDI^OOOiT-T- 

OOOOOOOOOOO 



YRI 



O T- 

ooooooooooo 



B1 
B2 
B3 
B4 
B5 
B6 
B7 



B1 
B2 
B3 
B4 
B5 
B6 
B7 



rMCO^lO(DS00O5T-T- 

ooooooooooo 



CHD 



O T- 

ooooooooooo 



JPT 



T-CM(OTfir)(DNCOO)T-^- 

ooooooooooo 



fraction 
1.0 
0.8 
0.6 
0.5 
0.4 
0.3 
0.25 
0.2 
0.15 
— I 0.1 

— 0.05 

— 0.02 



Figure 5 Relationships between local haplotypes in B and C regions. For each HapMap population, the distribution of haplotype combinations is 
shown as a heat map (scale on right). Recurrent recombination between core- and B-region is apparent. Predominant association of C1 1 with B6 
contrasts with associations of C10 (B2) and C9 (B2, B7). Counts are shown in File S2. 



in Africa and which most likely arose outside of Africa. As 
expected from the near fixation of All IT in Europe, the Cll clade 
predominates there, and all other haplotypes are rare. Of the 
remaining 10 common core haplotype groups, all ancestral at 
rsl426654 y eight clearly have their origins in Africa (Figure 3B, 
Figure 4, and Table S4). Three early diverging haplotypes, CI, 
C2, and C4, are rare outside of Africa and clearly originated there. 
In the lineage containing the majority of haplotypes, each of the 
three branches, containing C5, C6-C7, and C8-C11, give strong 
evidence of having originated in Africa. C5 reaches its greatest 
abundance in West Africa and is rare outside of Africa. Within 
the other two branches, C6 and C9, which are the most common 
haplotypes in Africa, are also common worldwide, whereas C7 is 
abundant in East Asia and much less common but widespread in 
Africa. Consideration of the relationships among haplotype var- 
iants (Figure 4) indicates that C6, C7, and C9 (but not C8) dis- 
persed out of Africa and have diverse descendants present and 
originating in East Asia. Among these descendants is C10, which 
is abundant in East Asia (and the New World) but extremely rare 
in Africa (0.5% in LWK). Haplotype C3 represents the final early 
diverging lineage (Figure 4). Although the lineage containing this 
haplotype must have originated in Africa, C3 is rare in Africa 
(1.0% in MKK) but widely distributed in East Asia, the New 
World, and Oceania. The distributions of C3 and C10 are most 
consistent with origin outside of Africa and subsequent introduc- 
tion into Africa by migrations such as those documented by uni- 
parental markers (Richards et al 2006). 



Can we date the A111T mutation? 

The preceding analysis is consistent with a wide range of possible 
dates for the origin of All IT, including the period before the initial 
colonization of Europe by anatomically modern humans >40 thou- 
sand years ago (kya) (Mellars 2006). An estimate for the date of origin 
of All IT based on microsatellites (Beleza et al 2012) places the origin 
at 19 kya (95% confidence interval 6—38 kya), for a dominant model, 
or 11 kya (95% confidence interval 1 — 56 kya), for a more plausible 
additive model. To create an independent estimate, we applied a mo- 
lecular clock approach to 1000 Genomes data by using the combined 
C and D subregions. Because proportions of different classes of nu- 
cleotide substitutions in the Cll + D4 variants and in the human- 
chimpanzee alignment are not significantly different (x 2 = 4.42, df = 5, 
P = 0.49; Table SI 5), we combined these classes for analysis. For the 
combined population samples, before making corrections for under- 
counts in the source data, we obtained an estimate of 7.8 kya for the 
most recent common ancestor of the Cll + D4 haplotype combina- 
tion (Table 3). Corresponding 95% confidence limits are 4.8—12.2 
kya, whereas uncorrected estimates derived from individual European 
samples or the combined New World samples (also of European 
origin) ranged from 5.2 to 10.4 kya (Table 3). These values are clearly 
underestimates as a result of low sequence depth (1000 Genomes 
Project Consortium 2012). Adjustment for undercounting is substan- 
tial, increasing the estimated age for the combined samples to 12.4 
(95% confidence interval 7.6—19.2) kya. If mutation rates in recent 
humans are lower than predicted from the human- chimpanzee di- 
vergence (Scally and Durbin 2012), true ages will be even older. Our 
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Figure 6 Generation of haplotype C1 1 . Recombination between 
chromosomes containing C10 (white) and C3 (shaded) is indicated 
by X. The crossover product represented by the solid line is the 
precursor to C11; the reciprocal recombinant represented by the 
dotted line is not recovered. Ancestral alleles are represented by open 
symbols, and derived alleles are solid black, where rs1 2441 154, 
rs1 834640 (d), rs2675345 (c2), rs57 108441, and rs1 426654 (c11) are 
represented by the triangle, square, triangle, circle, and star, re- 
spectively. The B6 haplotype (here defined by 6 SNPs), just to the left 
of the C region, is that most commonly found in phase with the C3 
allele, as indicated; B2 and B3 are most commonly in phase with C10. 
Subsequent mutation at rs1 426654, represented by the black star, 
produced the predominant SLC24A5 A1 1 7 T -conta in in g haplotype, C11, 
which is globally associated with B6. 1000 Genomes Project data lo- 
calize the crossover to the 0.9-kb interval between rs571 08441 and 
rs78729596. Flanking regions are not shown. Note that the final prod- 
uct is identical under an alternate model in which mutation precedes 
recombination. 



adjusted dates overlap those previously reported (Beleza et al. 2012) 
and are also consistent with the lower limit for the origin of All IT set 
by the finding that the Alpine "iceman" dated to 5.3 kya was homo- 
zygous for this variant (Keller et al. 2012). This date range implies an 
origin clearly preceding the Neolithic transition in Europe. These dates 
are later than the initial colonization of Europe but are consistent with 
an All IT origin before or after post-glacial population expansions. 

Obtaining a better date for the origin of the All IT mutation is 
challenged by a number of issues. Our approach provides a date for 
the common ancestor of the sampled CI 1-D4— containing chromo- 
somes. This common ancestor may be significantly younger than the 

Table 3 A7 7 7Tdate estimation using 1000 Genomes Project data 



origin of All IT if positive selection was initially weak or nonexistent, 
or if there was a subsequent bottleneck. In addition, our date estima- 
tion relies on samples of predominantly European origin. Inclusion of 
Middle Eastern or South Asian examples would be expected to yield 
a more representative result. Incomplete detection of rare variants is 
a limitation that can be improved with higher-coverage sequencing. 
More direct limits on the age of All IT could result from geno typing 
of ancient human DNA. 

Where did C11 originate? 

The precursors to Cll, haplotypes C3 and C10, are common in East 
Asia and the New World (Figure S5), but the distribution of Cll 
indicates that these locations are not likely sites for the origin of 
Cll or its immediate precursor. Similarly, B6 not associated with 
Cll is distributed widely in East Asia and the New World (data not 
shown). The paucity of C3 and C10 among existing African haplo- 
types suggests that both events leading to the origin of Cll took place 
outside this continent. Our dating for this haplotype is consistent with 
a non- African origin. The most likely location for the origin of Cll is, 
therefore, within the region in which it is fixed or nearly so. As both 
models for the origin of Cll imply that C3 and C10 were present in 
ancestors of Europeans, the observed and inferred distributions of 
these autosomal haplotypes are consistent with the single- out- of- 
Africa hypothesis derived using uniparental markers (Oppenheimer 
2003; Macaulay et al. 2005). 

The presence in Africa of All IT only in association with Cll 
indicates that the observed examples, like those of C3 and C10, resulted 
from introduction into the continent subsequent to origin. The low 
diversity of B-region haplotypes associated with CI 1 in MKK, equiv- 
alent to that seen in European samples (Figure 5 and File S2) supports 
this view because those individuals live among a majority population 
with high B-region diversity. Although too few African Cll sequences 
have been determined to draw strong conclusions, those available from 
the 1000 Genomes Project show no evidence of greater age in the form 
of greater SNP diversity than the European examples. It should be 
noted that the relatively high abundance of All IT in several equatorial 
East African samples indicates the absence of sustained strong negative 
selection against this allele at low latitudes. 

Although a non- African origin for Cll is clear, near fixation of this 
haplotype over a wide geographical region prevents strong inferences 
regarding a precise location of origin. Existing data are consistent with 
a model in which the Cll precursor did not extend outside the 
geographical region in which Cll is now nearly fixed, a conclusion 
subject to limited haplotype sampling in some neighboring regions, 
such as India. With sufficiently strong positive selection for Cll, it is 
possible that this haplotype could have originated anywhere within its 
current range and spread via local migration. However, selection acting 
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Human-chimpanzee alignment contains 1 227 differences in aligned length of 1 25,531 , of total 1 27,41 9 nt in C+D region. When a divergence date of 6 million years 
ago is used, this corresponds to a per-site mutation rate of 8.1 x 10 10 year 1 . Dates shown are not corrected for undercounting of rare polymorphisms. For combined 
C1 1 sample, distribution of variants suggests that tabulated dates should be multiplied by a factor of 1 .58. kya, thousand years ago. 
Estimated effective samples sizes ranged from 9-13 for individual admixed New World samples. 
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in concert with major population migrations would have facilitated 
a much more rapid dispersal. Archeological, mitochondrial, and Y- 
chromosomal data suggest involvement of multiple dispersals in 
shaping the current populations of Europe and the Middle East 
(Soares et al. 2010). Because All IT is far from fixation in most Indian 
samples (Table SI), the high diversity of B-region haplotypes associated 
with Cll in the GIH sample may be the result of prolonged recombi- 
nation rather than early arrival of All IT. In fact, the decrease in 
frequency of Al 1 1 T to the east of Pakistan suggests that Cll originated 
farther to the west and after the initial genetic split between western 
and eastern Eurasians. On this basis, we hold the view that an origin of 
Cll in the Middle East, broadly defined, is most likely. 

Traditionally, uniparental markers have been used for the construc- 
tion of phylogenetic trees due to the absence of recombination. Here, 
autosomal haplotypes and the characterization of recombination events 
have helped us to define the genotypic phylogeny of a genomic region 
known to have been subject to strong natural selection. Such an approach 
to studying the phylogeny of autosomal regions may also be useful for the 
study of other loci under selection in humans and in other organisms. 
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